Word Selection for EBMT based on Monolingual Similarity and Translation Confidence
نویسندگان
چکیده
We propose a method of constructing an example-based machine translation (EBMT) system that exploits a content-aligned bilingual corpus. First, the sentences and phrases in the corpus are aligned across the two languages, and the pairs with high translation confidence are selected and stored in the translation memory. Then, for a given input sentences, the system searches for fitting examples based on both the monolingual similarity and the translation confidence of the pair, and the obtained results are then combined to generate the translation. Our experiments on translation selection showed the accuracy of 85% demonstrating the basic feasibility of our approach.
منابع مشابه
Automated Generalization of Translation Examples
Previous work has shown that adding generalization of the examples in the corpus of an example-based machine translation (EBMT) system can reduce the required amount of pretranslated example text by as much as an order of magnitude for Spanish-English and FrenchEnglish EBMT. Using word clustering to automatically generalize the example corpus can provide the majority of this improvement for Fre...
متن کاملImproving Translation Selection with a New Translation Model Trained by Independent Monolingual Corpora
We propose a novel statistical translation model to improve translation selection of collocation. In the statistical approach that has been popularly applied for translation selection, bilingual corpora are used to train the translation model. However, there exists a formidable bottleneck in acquiring large-scale bilingual corpora, in particular for language pairs involving Chinese. In this pap...
متن کاملBuilding a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings
Methods for text simplification using the framework of statistical machine translation have been extensively studied in recent years. However, building the monolingual parallel corpus necessary for training the model requires costly human annotation. Monolingual parallel corpora for text simplification have therefore been built only for a limited number of languages, such as English and Portugu...
متن کاملAdaptation Using Out-of-Domain Corpus within EBMT
In order to boost the translation quality of EBMT based on a small-sized bilingual corpus, we use an out-of-domain bilingual corpus and, in addition, the language model of an indomain monolingual corpus. We conducted experiments with an EBMT system. The two evaluation measures of the BLEU score and the NIST score demonstrated the effect of using an out-of-domain bilingual corpus and the possibi...
متن کاملThe influence of example-data homogeneity on EBMT quality
Homogeneity of large corpora is still a largely unclear notion. In this study we first make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures, such similarity being implicitly expressed in the data. The distribution of the similarity scores of such subcorpora ma...
متن کامل